JOPSS - Search Results

Search Results: Records 1-2 displayed on this page of 2

Presentation/Publication Type

Initialising ...

Refine

Journal/Book Title

Initialising ...

Meeting title

Initialising ...

First Author

Initialising ...

Keyword

Initialising ...

Language

Initialising ...

Publication Year

Initialising ...

Held year of conference

Initialising ...

Journal Articles

Optimization of fusion kernels on accelerators with indirect or strided memory access patterns

Asahi, Yuichi*; Latu, G.*; Ina, Takuya; Idomura, Yasuhiro; Grandgirard, V.*; Garbet, X.*

IEEE Transactions on Parallel and Distributed Systems, 28(7), p.1974 - 1988, 2017/07

https://doi.org/10.1109/TPDS.2016.2633349

Times Cited Count：7 Percentile：55.15(Computer Science, Theory & Methods)

High-dimensional stencil computation from fusion plasma turbulence codes involving complex memory access patterns, the indirect memory access in a Semi-Lagrangian scheme and the strided memory access in a Finite-Difference scheme, are optimized on accelerators such as GPGPUs and Xeon Phi coprocessors. On both devices, the Array of Structure of Array (AoSoA) data layout is preferable for contiguous memory accesses. It is shown that the effective local cache usage by improving spatial and temporal data locality is critical on Xeon Phi. On GPGPU, the texture memory usage improves the performance of the indirect memory accesses in the Semi-Lagrangian scheme. Thanks to these optimizations, the fusion kernels on accelerators become 1.4x - 8.1x faster than those on Sandy Bridge (CPU).

Oral presentation

Acceleration of stencil-based fusion kernels

Asahi, Yuichi*; Latu, G.*; Ina, Takuya; Idomura, Yasuhiro; Grandgirard, V.*; Garbet, X.*

no journal, ,

Computation kernels of fusion plasma turbulence codes based on the Semi-Lagrangian scheme and the Finite-Difference scheme are optimized on latest many core processors such as GPGPU, XeonPhi, and FX100, and 1.4x-8.1x speedup is achieved. Affinity between different memory access patterns in each numerical scheme and difference memory-cache architectures on each hardware is studied, and different optimization techniques are developed for each architecture. On Xeon Phi, thread load balance is improved, and an optimization technique for effective local cache usage is developed. On GPGPU, an optimization technique using a texture memory and an implementation to reuse registers are developed. On the other hand, on FX100, it is found that the conventional optimization techniques for CPU work.